Time Series Modeling

2017-04-28, Josh Montague

This is Part 2 of the "Time Series Modeling in Python" series.

In Part 1, we looked at data structures within the pandas library that make working with time series particularly convenient.

Here, in Part 2, we'll look at the some of the simplest methods for modeling a time series, making forecasts, and evaluating the accuracy of our models. We'll take advantage of both pandas and numpy functionality since they're both great for these sorts of tasks.


In [ ]:
import copy
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from sklearn.metrics import r2_score, mean_squared_error

In [ ]:
%matplotlib inline

plt.rcParams["figure.figsize"] = (8,6)

In [ ]:
# `unemploy.csv` is included in the repo - it's a small csv
! head misc/unemploy.csv

In [ ]:
# get data from local dir
data = pd.read_csv('misc/unemploy.csv', parse_dates=True, index_col=0)    

data.head()

In [ ]:
# this is for the data we have locally
data_name = 'UNEMPLOY'

< optional web data acquisition >

If you want to experiment with other easily accessible data...

# first, change this cell type to "code" ! pip install pandas-datareader
# choose a different data_name to grab any other data set by name, # which you can find here: https://fred.stlouisfed.org/ # and then update the `data_name` variable in the previous cell import pandas_datareader.data as web start = pd.to_datetime('2010-01-01') end = pd.to_datetime('2017-01-01') data = web.DataReader(data_name, 'fred', start, end) data.head()

</ data acquisition >



In [ ]:
# what does this data look like?
data.plot()

Simple models

Let's start with the simplest set of things we can possibly do in our forecasting model[1]. These models are computationally simple, require only a small amount of data to make predictions, and generally provide a baseline over which to compare more complicated models.

Generally, we will want some form of a forecast() method.

But we'll get to abstractions in a moment. First, let's start by just doing it directly.

Naive method ( + .shift())

Let's start with the most straightforward and obvious forecasting method possible: the next data point will be the same as the previous data point. This approach (reasonably) assumes that in many real world systems there is some form of inertia or momentum in the underlying processes that is greater than the associated random fluctuations.

In form, what we'd like is a DataFrame with a column of observed data and column of forecasted data.

To start, our forecasted column will be a shifted version of the existing values column. We can do that with pandas' .shift() method.


In [ ]:
# check the top of the dataset
data.head()

In [ ]:
# what does the shift() method do?
# (remember that pandas methods return a new df)
data.shift().head()

In [ ]:
# what happens at the end of the series?
data.tail()

In [ ]:
# 'periods=1' is the default first arg
data.shift(periods=1).tail()

Note that the shift() method effectively slides the index up (relative to the data), keeping the same index range but clipping or extending the data column as needed.

What if you want the data column to have the same range, but shift the index? For this case, the shift() method has a freq kwarg to use instead of the (implicit) periods kwarg.


In [ ]:
data.head()

In [ ]:
# a timely reminder of bit.ly/pd-offsets
data.shift(freq='2D').head()

Note that the data column stayed fixed, but we adjusted each value in the index according to the freq kwarg while maintaining the period e.g. one month.


In [ ]:
data.tail()

In [ ]:
data.shift(freq='1M').tail()

Ok, great, now that we can simply shift the existing data column; let's attach that our original frame as our first "forecast" series.


In [ ]:
# use a copy for our forecast frame
d = copy.deepcopy(data)

In [ ]:
# assign the "forecast" to be the 1-period shifted version of the "observed" column
d['forecast'] = data.shift(1)[data_name]

d.head()

We made a forecast!


In [ ]:
Image(filename='misc/kermit.png')

Now how do we assess our model?

It's always a good idea to do a visual inspection of the data. Both in the original of the data (time series), and also in the space of your truth vs predictions.


In [ ]:
# how does our forecast look by eye?
d.plot()

plt.title('naive')

Not so bad! Certainly better than a random number generator.

Another common way to inspect a prediction (in any supervised learning task) is to plot the true data against the predicted data and fit a line through it. In the case where prediction = truth, this will be a beautiful, straight line, with zero residual error across the entire data set.

If you do see a beautiful line like this, you most likely made a mistake somewhere; the real world is messy. Double check that you didn't accidentally train on your hold-out data, or evaluate on your training data, or something similar.

In the real world (if your model is good), your plot should appear roughly linear, and should have some variance around the line you would draw through the data.


In [ ]:
plt.scatter(d[data_name], d[data_name])

plt.xlabel('truth')
plt.ylabel('also truth')
plt.title('this will never happen');

In [ ]:
plt.scatter(d[data_name], d['forecast'])

plt.xlabel('truth')
plt.ylabel('forecast')
plt.title("variance is a sign that you're alive");

Looks pretty good! We'll come back to a more quantitative assessment of this prediction in a bit.

Before we move to other models, let's convert the naive model to a functional form. This way, as we develop more forecasting models, we can use the same API for consistency.

numpy > pandas

While we typically have our data in DataFrames, we'll see that most (all?) of our forecasting methods don't make use of the extra metadata in a DataFrame or Series. As a result, we'll define our forecasting models with the expectaion of numpy arrays as input (which are more general), and we'll know that we can always use the .values attribute to get the respective numpy array data out of a pandas object.

Here's an implementation of the naive model in functional form. It's a lot of typing for a simple model, but the pattern will prove useful.


In [ ]:
def fc_naive(data, **kwargs):
    """The 'naive' forecast of the next point in `data` (presumed to be 
    ordered in time) is equal to the last point observed in the series.
    
    `data` should be a 1-D numpy array
    
    Returns a single-valued forecast for the next value in the series.
    """
    forecast = data[-1]
    return forecast

We can use this function to create a similar DataFrame to the one we used earlier. We'll loop over the observed data, and use the past data as our input.


In [ ]:
# container for the forecast
forecasts = []

# loop over positions in the array
for idx in range(len(data[data_name])):
    # subset the series from beginning to position idx
    array_slice = data[data_name].iloc[:idx].values
    if idx < 10:
        # a view behind the scenes...
        print('iteration {}, array_slice: {}'.format(idx, array_slice))
    # make a forecast using that series
    try:
        forecasts.append( fc_naive(array_slice) )
    except IndexError:
        # the first position won't have a forecast value
        forecasts.append(np.nan)

In [ ]:
d = copy.deepcopy(data)
d['forecast'] = forecasts

d.head()

In [ ]:
d.plot()

plt.title('same naive graph');

NOW FOR MOAR MODELS!

Seasonal naive method

Instead of just forecasting based on the previous point, we might know there is a consistent cycle within the data. Often, the term "seasonal" is used to describe any sort of cyclic behavior. Seems sloppy, but oh well. In this case, we can prescribe how far back to look in the series, and forecast the corresponding value.


In [ ]:
def fc_snaive(data, n=7, **kwargs):
    """The 'seasonal naive' forecast of the next point in `data` (presumed to be 
    ordered in time) is equal to the point observed `n` points prior in the series.
    
    `data` should be a 1-D numpy array
    `n` should be an integer
    
    Returns a single-valued forecast for the next value in the series.
    """
    forecast = data[-n]
    return forecast

Note that we aren't paying any attention to the observed resolution of the data, only the relative position of any cyclic behavior.

Since we want to create a new forecast array (as before), let's make a function that encapsulates the iteration over an observed array and returns the corresponding forecast array.


In [ ]:
def forecast_series(observed, fc_func, **kwargs):
    """Returns an array of forecasted values (using `fc_func` and any `kwargs` like a window `n`) 
    for each value in the input np.array `observed`. 
    """
    # container for the forecast
    forecasts = []

    for idx in range(len(observed)):
        # subset the series from beginning to position idx
        array_slice = observed[:idx]
        # make a forecast using that series
        try:
            forecasts.append( fc_func(array_slice, **kwargs) )
        except IndexError:
            # the first position won't have a forecast value
            forecasts.append(np.nan) 
    return forecasts

In [ ]:
d = copy.deepcopy(data)

# our data is monthly, and i have a hunch about quarterly cycles, so let's use n=3 (3 months in a quarter)
forecasts = forecast_series(d[data_name].values, fc_snaive, n=3)

d['forecast'] = forecasts

d.head()

In [ ]:
d.plot()

plt.title('seasonal naive (n=3)')

In [ ]:
plt.scatter(d[data_name], d['forecast'])

plt.xlabel('truth')
plt.ylabel('forecast')
plt.title('seasonal naive method')

Mean method

Another straightforward model consists of averaging the past n points and forecasting using that mean value.


In [ ]:
def fc_mean(data, n=3, **kwargs):
    """The 'mean' forecast of the next point in `data` (presumed to be 
    ordered in time) is equal to the mean value of the most recent `n` observed points.
    
    `data` should be a 1-D numpy array
    `n` should be an integer
    
    Returns a single-valued forecast for the next value in the series.
    """
    # don't start averaging until we've seen n points
    if len(data[-n:]) < n:
        forecast = np.nan
    else:
        # nb: we'll keep the forecast as a float
        forecast = np.mean(data[-n:])
    return forecast

In [ ]:
d = copy.deepcopy(data)

# let's try a 4-point rolling mean
forecasts = forecast_series(d[data_name].values, fc_mean, n=4)

d['forecast'] = forecasts

d.head()

In [ ]:
d.plot()

plt.title('mean forecast (n=3)');

In [ ]:
plt.scatter(d[data_name], d['forecast'])

plt.xlabel('truth')
plt.ylabel('forecast')
plt.title('mean method')

Drift method

The last simple model we'll look at involves a linear extrapolation from recently observed data. In this case, the prediction is an adjustment from the most recent observed point, according to the slope extrapolated from two points: one at n points prior, and the most recent point.


In [ ]:
def fc_drift(data, n=3, **kwargs):
    """The 'drift' forecast of the next point in `data` (presumed to be 
    ordered in time) is a linear extrapoloation from `n` points ago, through the
    most recent point.
    
    `data` should be a 1-D numpy array
    `n` should be an integer
    
    Returns a single-valued forecast for the next value in the series.
    """
    yi = data[-n]
    yf = data[-1]
    slope = (yf - yi) / (n-1)
    forecast = yf + slope
    return forecast

In [ ]:
d = copy.deepcopy(data)

# let's try a 5-point drift method
forecasts = forecast_series(d[data_name].values, fc_drift, n=5)

d['forecast'] = forecasts

d.head()

In [ ]:
d.plot()

plt.title('drift method');

In [ ]:
plt.scatter(d[data_name], d['forecast'])

plt.xlabel('truth')
plt.ylabel('forecast')

Model accuracy metrics

At this point, we've looked at four different simple models: naive, seasonal naive, (rolling) mean, and drift. A next reasonable question is: which one best reflects our data? To answer that, we'll look to some metrics for model accuracy measurements.

As is often the case, scikit-learn has a module that supplies a handful of these out of the box. The first thing you'll note is that there are more metrics for classification accuracy than for regression accuracy evaluation. Still, at least we don't have to reinvent the wheel!

First, let's make some forecasts...


In [ ]:
d = copy.deepcopy(data)

# feel free to tweak the 'n' args
model_list = [('naive', fc_naive, 0),
              ('seasonal_naive', fc_snaive, 3),
              ('mean', fc_mean, 3),
              ('drift', fc_drift, 5)]

# create new cols for each model
for name, model, nn in model_list:
    d[name] = forecast_series(d[data_name].values, model, n=nn)

In [ ]:
d.head()

In [ ]:
d.plot()

plt.title('ALL THE FORECASTS!');

In [ ]:
for name, series_data in d.items():
    plt.plot(d[data_name], series_data, 'o', alpha=0.6, label=name)

plt.xlabel('truth')    
plt.ylabel('pred')
plt.title('another view')
plt.legend()

Another view on these charts to quantify the quality of a model is to look at the distribution of residuals.


In [ ]:
comparison = 'naive'

(d[data_name] - d[comparison]).hist(bins=30)

plt.xlabel('residuals')
plt.title('residual distribution for method: {}'.format(comparison));

They all look pretty good, but we can be more specific.

$R^2$ (coefficient of determination) score

The $R^2$ score is a common regression scoring method. The user guide (and wikipedia) have nice explanations of both the defintion, and the domain; in short, the maximum value is 1.0, scoring a constant prediction that is the mean value of the observed data will give 0.0, and it can be arbitraily negative). The $R^2$ metric is basically what your eyeballs are assessing when you look at a scatter plot of the truth data vs. the predicted data.

Let's look at the $R^2$ value for the models that we've introduced so far.


In [ ]:
print('* R2 scores (bigger = better) *\n')

# calculate R2 for each model (against the observed data)
for name, series_data in d.items():
    # strip rows with nans 
    subdf = d[[data_name, name]].dropna()
    truth = subdf[data_name].values
    pred = subdf[name].values
    # calculate metric
    r2 = r2_score(truth, pred)
    print('{} - {:.4f}'.format(name, r2))

Feel free to experiment (or BYO GridSearchCV), but I think you'll typically find these are all >0.95, and the naive model frequently has the highest value.

Mean squared error

The mean squared error (MSE) is a simpler calculation that is the expected value of the quadratic error.

$$ MSE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_i (y_i - \hat{y}_i)^2 $$

In [ ]:
print('* MAE scores (smaller = better) *\n')

# calculate MAE for each model (against the observed data)
for name, series_data in d.items():
    # strip rows with nans 
    subdf = d[[data_name, name]].dropna()
    truth = subdf[data_name].values
    pred = subdf[name].values
    # calculate metric
    mae = mean_squared_error(truth, pred)
    print('{} - {:.4f}'.format(name, mae))

Since MAE isn't normalized, it's a little hard to eyeball (big numbers), and to compare different models. There are also mean absolute, and median absolute errors, if you feel like you want to "penalize" outliers in any particular fashion.

I like $R^2$ because it's relateively straightforward, but it's good to know there are alternatives.

Wrap-up

While none of these models are particularly fancy, they provide great baselines for any other model you can dream up. They are generally very fast and straightforward to evaluate, and provide a surprisingly high accuracy on many forecasting tasks.

For some broader context on forecasting (if, sadly, written in R), check out Ref [1].

Hopefully, there are still two classes of time series modeling that we'll look at in the future: so-called "state space models" (of which the most famous is the ARIMA family), and more recent, neural network-based models. When we do get to these models, now we can baseline them against these simple methods!


pandas Appendix

There are a handful of pandas methods that are time-series related, but not necessarily about forecasting and this seemed like a good place to highlight them. In particular, if you'd like to transform a time series according to a rolling window average, or exponential smoothing, there are efficient built-in methods for that!


In [ ]:
# recall our original 'data' dataframe
data.head()

Window functions

The official docs include many examples, but a common pattern is creating a Rolling object - which has the notion of a window size, and a window type (square, triangular, etc.) - and then applying functions like you would after a groupby or a resample.


In [ ]:
# make a "rolling" object that we can use for calculations
r = data.rolling(window=5)

# this object can be treated much like a GroupBy object
r

In [ ]:
# we can apply a number of methods to the Rolling object, like standard numerical calcs
r.mean().head(10)

In [ ]:
plt.plot(data, 'o--', label=data_name)
plt.plot(r.mean(), '.-', label='rolling mean')

plt.legend()

In [ ]:
plt.plot(data, 'o--', label=data_name)
plt.plot(r.max(), '.-', label='rolling max')

plt.legend()

In [ ]:
# calculate stdev on the rolling object within window size
stds = r.std()

# add the stdevs as error bars on each point
data.plot(style='o', yerr=stds)

plt.title('observed data points + windowed stdev')
plt.legend();

Exponentially weighted windows

Finally, if you should want them, there are windows that have an exponentially decaying weight. These are relatively new to pandas, and havequestionable documentation. But, they're configured by either a span (much like the rolling window), the decay constant alpha, or a couple of related.


In [ ]:
plt.plot(data, 'o-', label=data_name)
plt.plot(data.ewm(span=5).mean(), '.--', label='EMW')

plt.legend();

References

[1] OText on Forecasting


In [ ]: